Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up parsejson 3.25x (with a gcc-10.2 PGO build) on number heavy input #16055

Closed
wants to merge 16 commits into from

Conversation

c-blake
Copy link
Contributor

@c-blake c-blake commented Nov 19, 2020

Also add a couple convenience iterators and update changelog.md.

The basic idea here is to process the ASCII just once in a combination
validation/classification/parse rather than validating/classifying a
number in parsejson.parseNumber, then doing the same work again in
parseFloat/parseInt and finally doing it a 3rd time in the C library
strtod/etc. The last is especially slow on some libc's/common CPUs
due to long double/80-bit extended precision arithmetic on each digit.
In any event, the new fully functional parseNumber is not actually much
more code than the old classifying-only parseNumber.

Because we aim to do this work just once and the output of the work is
an int64|float, this PR adds those exported fields to JsonParser. (It
also documents two existing but undocumented fields.)

One simple optimization done here is pow10. It uses a small lookup
table for the power of 10 multiplier. The full 2048*8B = 16 KiB one is
bigger than is useful. In truth most numbers in any given run will
likely be of similar orders of magnitude, meaning the cache cost would
not be heavy, but probably best to not rely only likelihood. So fall
back to a fast integer-exponentiation algorithm when out of range.

The new parseNumber itself is more or less a straight-line parse of
scientific notation where the '.' can be anywhere in the number. To do
the right power-of-10 scaling later, we need to bookkeep a little more
than the old parseNumber as noted in side comments in the code.

Note that calling code that always wants a float even if the textual
representation looks like an integer can do something like if p.tok == tkInt: float(p.i) else: p.f or similar, depending on context.

Note that since we form the final float by integer * powerOf10 and since
powers of 10 have some intrinsic binary representational error and since
some alternative approaches (on just some CPUs) use 80-bit floats, it is
possible for the number to mismatch those other approaches in the least
significant mantissa bit (or for the ASCII->binary->ASCII round-trip to
not be the identity). On those CPUs only, better matching results can
maybe be gotten from an emit using long double if desired (also with
a long double table for powers of 10 and the powers of 10 calculation).
This strikes me as unlikely to be truly needed long-term, though.

…input.

Also add a couple convenience iterators and update `changelo.md`.

The basic idea here is to process the ASCII just once in a combination
validation/classification/parse rather than validating/classifying a
number in `parsejson.parseNumber`, then doing the same work again in
`parseFloat`/`parseInt` and finally doing it a 3rd time in the C library
`strtod`/etc.  The last is especially slow on some libc's/common CPUs
due to `long double`/80-bit extended precision arithmetic on each digit.
In any event, the new fully functional `parseNumber` is not actually much
more code than the old classifying-only `parseNumber`.

Because we aim to do this work just once and the output of the work is
an int64|float, this PR adds those exported fields to `JsonParser`.  (It
also documents two existing but undocumented fields.)

One simple optimization done here is pow10.  It uses a small lookup
table for the power of 10 multiplier.  The full 2048*8B = 16 KiB one is
bigger than is useful.  In truth most numbers in any given run will
likely be of similar orders of magnitude, meaning the cache cost would
not be heavy, but probably best to not rely only likelihood.  So fall
back to a fast integer-exponentiation algorithm when out of range.

The new `parseNumber` itself is more or less a straight-line parse of
scientific notation where the '.' can be anywhere in the number.  To do
the right power-of-10 scaling later, we need to bookkeep a little more
than the old `parseNumber` as noted in side comments in the code.

Note that calling code that always wants a `float` even if the textual
representation looks like an integer can do something like `if p.tok ==
tkInt: float(p.i) else: p.f` or similar, depending on context.

Note that since we form the final float by integer * powerOf10 and since
powers of 10 have some intrinsic binary representational error and since
some alternative approaches (on just some CPUs) use 80-bit floats, it is
possible for the number to mismatch those other approaches in the least
significant mantissa bit (or for the ASCII->binary->ASCII round-trip to
not be the identity).  On those CPUs only, better matching results can
maybe be gotten from an `emit` using `long double` if desired (also with
a long double table for powers of 10 and the powers of 10 calculation).
This strikes me as unlikely to be truly needed long-term, though.
@c-blake
Copy link
Contributor Author

c-blake commented Nov 19, 2020

I won't be able to track down what is going wrong in the tests until at least tomorrow and maybe not until Monday if anyone is in a rush and wants to help. I did test it on parsing a whole big file through the json module interface and various other ways. So, it's probably not much to get it working.

Also, here is a link back to the whole, long forum thread that inspired this.

@c-blake
Copy link
Contributor Author

c-blake commented Nov 19, 2020

Also, instead of slice assignment we could probably my.a.setLen and copyMem for a little boost in the slower str(Integers|Floats)=true mode.

slice assignment.  Within 1.03x of non-copy mode in a PGO build (1.10x
if the caller must save the last valid string).  One might think that
this argues for ditching `strFloats` & `strIntegers` entirely and just
always copying.  In some sense, it does.

However, the non-copy mode is a useful common case here, as shown in
`jsonTokens(string)` example code.  So, it is a bit of a judgement call
whether supporting that last 10% of perf in a useful common case matters.
@c-blake
Copy link
Contributor Author

c-blake commented Nov 20, 2020

Does anyone know why I am getting "undeclared identifier: 'copyMem'"s? That seems weird.

@c-blake
Copy link
Contributor Author

c-blake commented Nov 20, 2020

Nevermind. I guess NimScript does not support copyMem. That seems kinda lame. :(

@c-blake
Copy link
Contributor Author

c-blake commented Nov 20, 2020

This is only half related to this PR, but in the json module, the new rawIntegers & rawFloats parameters are not always exposed/handled well. E.g. parseJsonFragments should take all 3 raws (StringLiterals, too) & pass them to p.open, but not to p.parseJson.

@c-blake
Copy link
Contributor Author

c-blake commented Nov 20, 2020

I have a question. The very end of tests/stdlib/tjson.nim converts 100 '9's into a string. 9.999..e100 is well within the range of a double which is really what JSON Number's really are. Would it not make more sense to produce a Nim float?

@c-blake
Copy link
Contributor Author

c-blake commented Nov 20, 2020

Never mind. I am reading the long discussion. { A lot of never mind's in this one. That'll teach me to code before coffee. ;-) }

e = e shr 1
base *= base

proc parseNumber(my: var JsonParser): TokKind {.inline.} =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not the proc kinda too big to be {.inline.} ?. 🤔

Copy link
Contributor Author

@c-blake c-blake Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. In the by far most common case it is just one line (two almost always perfectly predicted compare-jumps and one array reference).

It is only used just once, for human mental abstraction. So, there is no duplicated generated code.

Also, people are too sparing of {.inline.} in my view. Function call overhead is like 12..15 cycles * 3..6way superscalar =~ 36..90 dynamic instruction slots. So, if duplication is not a concern, that is the the rough amount of work to compare to.

If you were really going to browbeat me over inline then you should complaining about parseNumber being inline, not pow10. ;-) {EDIT: oh, wait, that's what you were doing! Sorry! } But many numbers to parse might be just single-digits which take only a few cycles. In fact, the more numbers you have the more likely many are single digits. So, the function call overhead would then be quite large compared to the work done.

Copy link
Contributor Author

@c-blake c-blake Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing inline buys you here is putting parseNumber C code in the same translation unit as the parsing itself. So, the C compilers metrics/heuristics for inlining can at least be applied. Otherwise you rely on LTO which a lot of people do not enable. Given that this whole bit of work was to get really fast speed, it seems appropriate for that reason as well as the "many single digit" cases. If you do PGO builds the compiler will actually try to measure if inlining is good or not. According to the GCC guys that is the main benefit of PGO. I am always getting 1.5x to 2.0x speed ups from PGO with Nim which suggests to me that our {.inline.} choices are not generally so great (or at least a few critical choices are not...).

Copy link
Contributor Author

@c-blake c-blake Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and also parseNumber is also only called in one place in getTok. So, still no extra L1 I-cache style resource hit, and I also added {.inline.} on getTok to get the meat of all this logic into the translation unit of whatever is using the json parser.

Copy link
Contributor Author

@c-blake c-blake Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Nim's {.inline.} just marks it in the generated C as inline. So, it's still up to C/C++ compiler code metrics what ultimately happens. Those surely do take into account the statically observable number of expansions/substitutions and their sizes. So, the many calls to getTok from json are not a real problem. But if anyone really wants me to remove them I can. I always use (and recommend using) PGO when performance matters, myself.

this is the only remaining problem.  Before merging we should make
overflow detection more precise than just a digit count for both
`tkInt` and `tkFloat`.
@c-blake
Copy link
Contributor Author

c-blake commented Nov 20, 2020

@Araq WARNING: Not ready to merge, though it passes all checks. (See final commit comment.)

@c-blake
Copy link
Contributor Author

c-blake commented Nov 20, 2020

Question probably for @Araq and all those in the prior discussion of this:


Integer overflow going from JSON to a Nim int64 (or uint64 for that matter) is well defined because there is only one way for it to happen. On the other hand, floating point is "two dimensional". A string can overflow the precision (when it must be a long string) or overflow the range (with a too big or too small exponent) or both. Exponent overflow is pretty ordinary and unproblematic. Both exponent and integer can probably even have little 10-entry digit-indexed tables to check pre-multiply.

The precision dimension seems tricky, though. I know that sometimes people format floats to 18..20 digits just to try to get better round trips through 80-bit FP formats. I don't know how frequently other prog.lang json libs do this, though. Does anyone know? I avoid json due to it being atrociously 30-200x slower than binary formats and consequently don't have a lot of experience with the JSON libs. If they never do more than 16 then there is no problem.

If json libs frequently/ever emit an extra digit or 3 a lot then it seems a bit overboard to have 17..20 decimal digits alone launch things into arbitrary precision float arithmetic on the Nim receiver. I guess someone could hook into a preprocessor that truncated/rounded extra digits, but blech. Ultimately, if one doesn't really know how the ASCII was generated then client code needs to provide guidance for this "auto promotion to JString/arb.prec float" idea. So, it seems like it's reason enough for yet another new parameter, maxNativeFloatDecimals. However, these interfaces are getting pretty busy/crowded.

So, it seemed like asking how people would like this handled was warranted. With the below idea a set[enum] is starting to smell more user friendly. We're only like 3 weeks into this raw* era. So, I am taking it as maybe still changeable.


Also, it seems like one thing that never came up on that overflow discussion was the idea of storing the uint64 in an int64 via cast[int64]() but setting a flag in the parser so that client code knows it should cast[uint64]() back. I'm not sure this is the best answer, but it wasn't really discussed and also seems to solve everyone's issues - performance "right at 64-bits unsigned" for @timotheecour, backward compat for @disruptek et al.

This would mean client code expecting ints in the range 2^63..2^64-1 needs to check a bool flag and maybe do a cast[uint64](num) (but client code may be casting unconditionally already if it is expecting uint64).

Now, I know your first reaction is probably that converting 2^63 into -2^63 seems like a keeerazy footgun. However, if I understand correctly, before an exception was just raised for this data. So, client code had to try/except/do something special anyway to not fail. This would just change what it needs to do to not fail. Another new bool flag could also control whether you get the old exception or new cast behavior. Best of all worlds?

I will probably hold off making these overflow conditions more precise until Monday/I hear back from various folks on these questions. Alternatively, I am also fine if you want to merge this & let the imprecise test stand momentarily while someone else works on said overflow checking and/or API issues.

@disruptek
Copy link
Contributor

The bool flag on a parseJson makes sense to me, but I'd make it a proper options = set[enum] so we can extend it further going forward. I trust whatever you come up with here; it seems like this code is moving in the right direction -- thank you. 👍

@c-blake
Copy link
Contributor Author

c-blake commented Nov 20, 2020

Also, if "ever storing" the uint64 in an int64 bothers anyone too much, we could add a new field to JsonParser that just mirrors the bit pattern in all cases. I think then cast could be avoided and users expecting uint64s could just access that field (or call getUint). We would still probably want an options: set[enum] parameter instead of 5 bool parameters. This seems close to what @timotheecour & @krux02 were lobbying for without breaking old code.

@c-blake
Copy link
Contributor Author

c-blake commented Nov 22, 2020

Well, if I get no more feedback on the future direction here, I will just do something tomorrow. I'm a little surprised, as the prior discussion seemed pretty passionate.

finally:
p.close()

iterator jsonTokens*(path: string, rawStringLiterals = false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These API additions are not good for two reasons:

  • An iterator that always produces the same value is a foreign concept.
  • It looks like they are unused. :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're unused because the idea was to provide something like json.parseJsonFragments but in the raw parsing parsejson context. I think they are also not normal in yielding a ptr JsonParser, but that actually made like 1.5X performance difference with the new faster logic because the JsonParser object is big enough the alloc/copies showed up a the top of the profile. It mostly seemed the easiest way to really streamline the usage syntax, as in here. I'm open to alternative suggestions to streamline said syntax, or just removing them and leaving them to the user, though.

I was more concerned about your feedback about int64 but not uint64 overflow ideas, going down the options = set[enum] route and adding a new u: uint64 field for users to conditionally/unconditionally access when they maybe/definitely expect unsigned, and that sort of direction. That's much more to do/get right than, say, moving those iterators to example code/doing something else. So, if you hated the general direction I wanted to hold off on that work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type could be lent T but I don't know of today's compiler accepts that (it really shouldn't without a spec addition...).

For now please remove these iterators, it's easy enough to write a open/close/loop snippet.

I was more concerned about your feedback about int64 but not uint64 overflow ideas, going down the options = set[enum] route and adding a new u: uint64 field for users to conditionally/unconditionally access when they maybe/definitely expect unsigned, and that sort of direction.

I would use a single int64 field and then accessor procs/templates that hide the cast[uint64](x). And options = set[enum] is the better design yes. I couldn't use it because of backwards compatibility reasons. There is a different solution that deals with the compat problems but it's annoying to do, maybe it's time to make .since a builtin that works better.

@planetis-m
Copy link
Contributor

Can this be disabled though? In the case of packedjson there is no reason to parse the number, it's stored as a string.

@c-blake
Copy link
Contributor Author

c-blake commented Nov 24, 2020

Yes. We can add a path based on rawIntegers|rawFloats (passed through) that only validates/frames the numeric part syntactically without building a binary answer. That is a good idea and not too hard. Thumbs up.

At the moment, though, I am working on the trickier problem of detecting arbitrary precision float overflow/underflow. :-) (The integer case is easy. glibc actually uses a full bignum library to parse floats in strtod() which is also why it is so slow.)

@c-blake
Copy link
Contributor Author

c-blake commented Nov 24, 2020

Do not merge this. I am working on a cleaner and more general approach in light of edge cases.

Vindaar added a commit to Vindaar/ggplotnim that referenced this pull request Dec 27, 2020
This uses @c-blake's code from
nim-lang/Nim#16055
to avoid having to copy the string which we parse and instead directly
parses a number into either an int or a string.

Note that this implementation has some edge cases that break it. In
particular a number that starts valid, but doesn't end as a valid
number (which is technically ok, but will break our CSV parser, since
our position in the memfile will *not* be on the next separator!).
Vindaar added a commit to Vindaar/ggplotnim that referenced this pull request Dec 30, 2020
This uses @c-blake's code from
nim-lang/Nim#16055
to avoid having to copy the string which we parse and instead directly
parses a number into either an int or a string.

Note that this implementation has some edge cases that break it. In
particular a number that starts valid, but doesn't end as a valid
number (which is technically ok, but will break our CSV parser, since
our position in the memfile will *not* be on the next separator!).
@FedericoCeratto
Copy link
Member

Related issues: #12833 #3809 #12152 nim-lang/RFCs#188

@stale
Copy link

stale bot commented Jun 4, 2022

This pull request has been automatically marked as stale because it has not had recent activity. If you think it is still a valid PR, please rebase it on the latest devel; otherwise it will be closed. Thank you for your contributions.

@stale stale bot added the stale Staled PR/issues; remove the label after fixing them label Jun 4, 2022
@stale stale bot closed this Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance stale Staled PR/issues; remove the label after fixing them
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants